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Abstract 

The natural language generation (NLG) 
component of a spoken dialogue system 
(SDS) usually needs a substantial amount 
of handcrafting or a well-labeled dataset to 
be trained on. These limitations add sig¬ 
nificantly to development costs and make 
cross-domain, multi-lingual dialogue sys¬ 
tems intractable. Moreover, human lan¬ 
guages are context-aware. The most nat¬ 
ural response should be directly learned 
from data rather than depending on pre¬ 
defined syntaxes or rules. This paper 
presents a statistical language generator 
based on a joint recurrent and convolu¬ 
tional neural network structure which can 
be trained on dialogue act-utterance pairs 
without any semantic alignments or pre¬ 
defined grammar trees. Objective metrics 
suggest that this new model outperforms 
previous methods under the same experi¬ 
mental conditions. Results of an evalua¬ 
tion by human judges indicate that it pro¬ 
duces not only high quality but linguisti¬ 
cally varied utterances which are preferred 
compared to n-gram and rule-based sys¬ 
tems. 

1 Introduction 

Conventional spoken dialogue systems (SDS) are 
expensive to build because many of the process¬ 
ing components require a substantial amount of 
handcrafting (Ward and Issar, 1994; Bohus and 
Rudnicky, 2009). In the past decade, signif¬ 
icant progress has been made in applying sta¬ 
tistical methods to automate the speech under¬ 
standing and dialogue management components of 
an SDS, including making them more easily ex¬ 
tensible to other application domains (Young et 
al., 2013; Gasic et al., 2014; Henderson et al.. 


2014). However, due to the difficulty of col¬ 
lecting semantically-annotated corpora, the use of 
data-driven NLG for SDS remains relatively un¬ 
explored and rule-based generation remains the 
norm for most systems (Cheyer and Guzzoni, 
2007; Mirkovic and Cavedon, 2011). 

The goal of the NLG component of an SDS is 
to map an abstract dialogue act consisting of an 
act type and a set of attribute-value pairs ^ into 
an appropriate surface text (see Table 1 below 
for some examples). An early example of a sta¬ 
tistical NLG system is HALOGEN by Langkilde 
and Knight (1998) which uses an n-gram language 
model (LM) to rerank a set of candidates gener¬ 
ated by a handcrafted generator. In order to re¬ 
duce the amount of handcrafting and make the 
approach more useful in SDS, Oh and Rudnicky 
(2000) replaced the handcrafted generator with a 
set of word-based n-gram LM-based generators, 
one for each dialogue type and then reranked the 
generator outputs using a set of rules to produce 
the final response. Although Oh and Rudnicky 
(2000)’s approach limits the amount of handcraft¬ 
ing to a small set of post-processing rules, their 
system incurs a large computational cost in the 
over-generation phase and it is difficult to en¬ 
sure that all of the required semantics are cov¬ 
ered by the selected output. More recently, a 
phrase-based NLG system called BAGEL trained 
from utterances aligned with coarse-grained se¬ 
mantic concepts has been described (Mairesse et 
al., 2010; Mairesse and Young, 2014). By im¬ 
plicitly modelling paraphrases. Bagel can generate 
linguistically varied utterances. However, collect¬ 
ing semantically-aligned corpora is expensive and 
time consuming, which limits Bagel’s scalability 
to new domains. 

This paper presents a neural network based 
NLG system that can be fully trained from dia- 

^Here and elsewhere, attributes are frequently referred to 
as slots. 



log act-utterance pairs without any semantic align¬ 
ments between the two. We start in Section 3 by 
presenting a generator based on a recurrent neural 
network language model (RNNLM) (Mikolov et 
ah, 2010; Mikolov et ah, 2011a) which is trained 
on a delexicalised corpus (Henderson et ah, 2014) 
whereby each value has been replaced by a symbol 
representing its corresponding slot. In a final post¬ 
processing phase, these slot symbols are converted 
back to the corresponding slot values. 

While generating, the RNN generator is condi¬ 
tioned on an auxiliary dialogue act feature and a 
controlling gate to over-generate candidate utter¬ 
ances for subsequent reranking. In order to ac¬ 
count for arbitrary slot-value pairs that cannot be 
routinely delexicalized in our corpus. Section 3.1 
describes a convolutional neural network (CNN) 
(Collobert and Weston, 2008; Kalchbrenner et al., 
2014) sentence model which is used to validate 
the semantic consistency of candidate utterances 
during reranking. Finally, by adding a backward 
RNNLM reranker into the model in Section 3.2, 
output fluency is further improved. Training and 
decoding details of the proposed system are de¬ 
scribed in Section 3.3 and 3.4. 

Section 4 presents an evaluation of the proposed 
system in the context of an application providing 
information about restaurants in the San Francisco 
area. In Section 4.2, we first show that new gener¬ 
ator outperforms Oh and Rudnicky (2000)’s utter¬ 
ance class LM approach using objective metrics, 
whilst at the same time being more computation¬ 
ally efficient. In order to assess the subjective per¬ 
formance of our system, pairwise preference tests 
are presented in Section 4.3. The results show 
that our approach can produce high quality utter¬ 
ances that are considered to be more natural than 
a rule-based generator. Moreover, by sampling ut¬ 
terances from the top reranked output, our system 
can also generate linguistically varied utterances. 
Section 4.4 provides a more detailed analysis of 
the contribution of each component of the system 
to the final performance. We conclude with a brief 
summary and future work in Section 5. 

2 Related Work 

Conventional approaches to NLG typically divide 
the task into sentence planning, and surface re¬ 
alisation. Sentence planning maps input seman¬ 
tic symbols into an intermediary tree-like or tem¬ 
plate structure representing the utterance, then sur¬ 


face realisation converts the intermediate structure 
into the final text (Walker et al., 2002; Stent et 
al., 2004; Dethlefs et al., 2013). As noted above, 
one of the first statistical NLG methods that re¬ 
quires almost no handcrafting or semantic align¬ 
ments was an n-gram based approach by Oh and 
Rudnicky (2000). Ratnaparkhi (2002) later ad¬ 
dressed the limitations of n-gram LMs in the over¬ 
generation phase by using a more sophisticated 
generator based on a syntactic dependency tree. 

Statistical approaches have also been studied 
for sentence planning, for example, generating 
the most likely context-free derivations given a 
corpus (Belz, 2008) or maximising the expected 
reward using reinforcement learning (Rieser and 
Lemon, 2010). Angeli et al. (2010) train a set 
of log-linear models to predict individual gen¬ 
eration decisions given the previous ones, using 
only domain-independent features. Along simi¬ 
lar lines, by casting NLG as a template extraction 
and reranking problem, Kondadadi et al. (2013) 
show that outputs produced by an SVM reranker 
are comparable to human-authored texts. 

The use of neural network-based approaches to 
NLG is relatively unexplored. The stock reporter 
system ANA by Kukich (1987) is a network based 
NLG system, in which the generation task is di¬ 
vided into a sememe-to-morpheme network fol¬ 
lowed by a morpheme-to-phrase network. Recent 
advances in recurrent neural network-based lan¬ 
guage models (RNNLM) (Mikolov et al., 2010; 
Mikolov et al., 2011a) have demonstrated the 
value of distributed representations and the abil¬ 
ity to model arbitrarily long dependencies for both 
speech recognition and machine translation tasks. 
Sutskever et al. (2011) describes a simple vari¬ 
ant of the RNN that can generate meaningful sen¬ 
tences by learning from a character-level corpus. 
More recently, Karpathy and Fei-Fei (2014) have 
demonstrated that an RNNLM is capable of gener¬ 
ating image descriptions by conditioning the net¬ 
work model on a pre-trained convolutional image 
feature representation. This work provides a key 
inspiration for the system described here. Zhang 
and Lapata (2014) describes interesting work us¬ 
ing RNNs to generate Chinese poetry. 

A specific requirement of NLG for dialogue 
systems is that the concepts encoded in the ab¬ 
stract system dialogue act must be conveyed ac¬ 
curately by the generated surface utterance, and 
simple unconstrained RNNLMs which rely on em- 



bedding at the word level (Mikolov et al., 2013; 
Pennington et al., 2014) are rather poor at this. 
As a consequence, new methods have been in¬ 
vestigated to learn distributed representations for 
phrases and even sentences by training models 
using different structures (Collobert and Weston, 
2008; Socher et al., 2013). Convolutional Neural 
Networks (CNNs) were first studied in computer 
vision for object recognition (Lecun et al., 1998). 
By stacking several convolutional-pooling layers 
followed by a fully connected feed-forward net¬ 
work, CNNs are claimed to be able to extract sev¬ 
eral levels of translational-invariant features that 
are useful in classification tasks. The convolu¬ 
tional sentence model (Kalchbrenner et al., 2014; 
Kim, 2014) adopts the same methodology but col¬ 
lapses the two dimensional convolution and pool¬ 
ing process into a single dimension. The resulting 
model is claimed to represent the state-of-the-art 
for many speech and NLP related tasks (Kalch¬ 
brenner et al., 2014; Sainath et al., 2013). 

3 Recurrent Generation Model 


lnform(name=Seven_Days, food=Chinese) —i dialog act 1-hot 
1 0, 0, 1, 0, 0,1, 0, 0, 1, 0, 0,... * representation 



delexicalisation 


Figure 1: An unrolled view of the RNN-based 
generation model. It operates on a delexicalised 
utterance and a 1-hot encoded feature vector spec¬ 
ified by a dialogue act type and a set of slot-value 
pairs, (g) indicates the gate used for controlling the 
on/off states of certain feature values. The output 
connection layer is omitted here for simplicity. 

The generation model proposed in this paper is 
based on an RNNLM architecture (Mikolov et al., 
2010) in which a 1-hot encoding of a token^ 
wt is input at each time step t conditioned on a re¬ 
current hidden layer and outputs the probability 
distribution of the next token wt^i. Therefore, by 
sampling input tokens one by one from the output 
distribution of the RNN until a stop sign is gen- 

^We use token instead of word because our model oper¬ 
ates on text for which slot names and values have been delex¬ 
icalised. 


erated (Karpathy and Fei-Fei, 2014) or some re¬ 
quired constraint is satisfied (Zhang and Lapata, 
2014), the network can produce a sequence of to¬ 
kens which can be lexicalised to form the required 
utterance. 

In order to ensure that the generated utterance 
represents the intended meaning, the input vec¬ 
tors Wt are augmented by a control vector f con¬ 
structed from the concatenation of 1-hot encod¬ 
ings of the required dialogue act and its associated 
slot-value pairs. The auxiliary information pro¬ 
vided by this control vector tends to decay over 
time because of the vanishing gradient problem 
(Mikolov and Zweig, 2012; Bengio et al., 1994). 
Hence, f is reapplied to the RNN at every time step 
as in Karpathy and Fei-Fei (2014). 

In detail, the recurrent generator shown in Fig¬ 
ure 1 is defined as follows: 

ht = sigmoid{Whh^t-i + ^wh^t + W//,ft) (1) 

P{wt+i\wt,wt-i, ...Wo, ft) = softmax{Who^t) (2) 

wt+i - P{wt+i\wt,wt-i,...wo,ft) (3) 

where Wgh, W//,, and Who are the 

learned network weight matrices, is a gated ver¬ 
sion of f designed to discourage duplication of in¬ 
formation in the generated output in which each 
segment of the control vector f corresponding 
to slot s is replaced by 

= (4) 

where 4 is the time at which slot s first appears 
in the output, 5 < 1 is a decay factor, and © de¬ 
notes element-wise multiplication. The effect of 
this gating is to decrease the probability of regen¬ 
erating slot symbols that have already been gener¬ 
ated, and to increase the probability of rendering 
all of the information encoded in f. 

The tokenisation resulting from delexicalising 
slots and values does not work for all cases. 
For example, some slot-value pairs such as 
food=dont-care or kids.allowed=false cannot be 
directly modelled using this technique because 
there is no explicit value to delexicalise in the 
training corpus. As a consequence, the model is 
prone to errors when these slot-value pairs are re¬ 
quired. A further problem is that the RNNLM gen¬ 
erator selects words based only on the preceding 
history, whereas some sentence forms depend on 
the backward context. 
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Figure 2: Our simple variant of CNN sentence model as described in Kalchbrenner et al. (2014). 


To deal with these issues, candidates gener¬ 
ated by the RNNLM are reranked using two mod¬ 
els. Firstly, a convolutional neural network (CNN) 
sentence model (Kalchbrenner et al., 2014; Kim, 
2014) is used to ensure that the required dialogue 
act and slot-value pairs are represented in the gen¬ 
erated utterance, including the non-standard cases. 
Secondly, a backward RNNLM is used to rerank 
utterances presented in reverse order. 

3.1 Convolutional Sentence Model 

The CNN sentence model is shown in Figure 2. 
Given a candidate utterance of length n, an utter¬ 
ance matrix U is constructed by stacking embed¬ 
dings wt of each token in the utterance: 

- wo - 


_- Wn-l -_ 

A set of K convolutional mappings are then ap¬ 
plied to the utterance to form a set of feature detec¬ 
tors. The outputs of these detectors are combined 
and fed into a fully-connected feed-forward net¬ 
work to classify the action type and whether each 
required slot is mentioned or not. 

Each mapping k consists of a one-dimensional 
convolution between a filter nik G and the 
utterance matrix U to produce another matrix C^: 

C^i^j = mk'^Ui —(6) 

where m is the filter size, and is the row and 
column index respectively. The outputs of each 


column of are then pooled by averaging^ over 
time: 

K (7) 

where h is the size of embedding and k = 1... K. 
Last, the K pooled feature vectors are passed 
through a nonlinearity function to obtain the final 
feature map. 

3.2 Backward RNN reranking 

As noted earlier, the quality of an RNN language 
model may be improved if both forward and back¬ 
ward contexts are considered. Previously, bidi¬ 
rectional RNNs (Schuster and Paliwal, 1997) have 
been shown to be effective for handwriting recog¬ 
nition (Graves et al., 2008), speech recognition 
(Graves et al., 2013), and machine translation 
(Sundermeyer et al., 2014). However, applying 
a bidirectional RNN directly in our generator is 
not straightforward since the generation process is 
sequential in time. Hence instead of integrating 
the bidirectional information into a single unified 
network, the forward and backward contexts are 
utilised separately by firstly generating candidates 
using the forward RNN generator, then using the 
log-likelihood computed by a backward RNNLM 
to rerank the candidates. 

3.3 Training 

Overall the proposed generation architecture re¬ 
quires three models to be trained: a forward RNN 
generator, a CNN reranker, and a backward RNN 
reranker. The objective functions for training the 

^Max pooling was also tested but was found to be inferior 
to average pooling 

































































two RNN models are the cross entropy errors be¬ 
tween the predicted word distribution and the ac¬ 
tual word distribution in the training corpus, whilst 
the objective for the CNN model is the cross en¬ 
tropy error between the predicted dialogue act and 
the actual dialogue act, summed over the act type 
and each slot. An I 2 regularisation term is added to 
the objective function for every 10 training exam¬ 
ples as suggested in Mikolov et al. (2011b). The 
three networks share the same set of word em¬ 
beddings, initialised with pre-trained word vectors 
provided by Pennington et al. (2014). All costs 
and gradients are computed and stochastic gra¬ 
dient descent is used to optimise the parameters. 
Both RNNs were trained with back propagation 
through time (Werbos, 1990). In order to prevent 
overfitting, early stopping was implemented using 
a held-out validation set. 

3.4 Decoding 

The decoding procedure is split into two phases: 
(a) over-generation, and (b) reranking. In the over¬ 
generation phase, the forward RNN generator con¬ 
ditioned on the given dialogue act, is used to 
sequentially generate utterances by random sam¬ 
pling of the predicted next word distributions. In 
the reranking phase, the hamming loss costcNN 
of each candidate is computed using the CNN 
sentence model and the log-likelihood costi^R^N 
is computed using the backward RNN. Together 
with the log-likelihood costfR^N from the for¬ 
ward RNN, the reranking score R is computed as: 

R = —{cost f RNN + cosRrnn + cosIcnn)- (8) 

This is the reranking criterion used to analyse each 
individual model in Section 4.4. 

Generation quality can be further improved by 
introducing a slot error criterion ERR, which is 
the number of slots generated that is either redun¬ 
dant or missing. This is also used in Oh and Rud- 
nicky (2000). Adding this to equation (8) yields 
the final reranking score i?*: 

i?* = - {cost f RNN + COSRrnN + 
costcNN + AERR) 

In order to severely penalise nonsensical utter¬ 
ances, A is set to 100 for both the proposed RNN 
system and our implementation of Oh and Rud- 
nicky (2000)’s n-gram based system. This rerank¬ 
ing criterion is used for both the automatic evalu¬ 
ation in Section 4.2 and the human evaluation in 
Section 4.3. 


4 Experiments 

4.1 Experimental Setup 

The target application area for our generation sys¬ 
tem is a spoken dialogue system providing infor¬ 
mation about restaurants in San Francisco. There 
are 8 system dialogue act types such as inform to 
present information about restaurants, confirm to 
check that a slot value has been recognised cor¬ 
rectly, and reject to advise that the user’s con¬ 
straints cannot be met (Table 1 gives the full list 
with examples); and there are 12 attributes (slots): 
name, count, food, near, price, pricerange, post¬ 
code, phone, address, area, goodformeal, and kid- 
sallowed, in which all slots are categorical except 
kidsallowed which is binary. 

To form a training corpus, dialogues from a set 
of 3577 dialogues collected in a user trial of a 
statistical dialogue manager proposed by Young 
et al. (2013) were randomly sampled and shown 
to workers recruited via the Amazon Mechanical 
Turk service. Workers were shown each dialogue 
turn by turn and asked to enter an appropriate 
system response in natural English corresponding 
to each system dialogue act. The resulting cor¬ 
pus contains 5193 hand-crafted system utterances 
from 1006 randomly sampled dialogues. Each cat¬ 
egorical value was replaced by a token represent¬ 
ing its slot, and slots that appeared multiple times 
in a dialogue act were merged into one. This re¬ 
sulted in 228 distinct dialogue acts. 

The system was implemented using the Theano 
library (Bergstra et al., 2010; Bastien et al., 2012). 
The system was trained by partitioning the 5193 
utterances into a training set, validation set, and 
testing set in the ratio 3:1:1, respectively. The 
frequency of each action type and slot-value pair 
differs quite markedly across the corpus, hence 
up-sampling was used to make the corpus more 
uniform. Since our generator works stochasti¬ 
cally and the trained networks can differ depend¬ 
ing on the initialisation, all the results shown be¬ 
low"^ were averaged over 10 randomly initialised 
networks. The BLEU-4 metric was used for the 
objective evaluation (Papineni et al., 2002). Mul¬ 
tiple references for each test dialogue act were ob¬ 
tained by mapping them back to the 228 distinct 
dialogue acts, merging those delexicalised tem¬ 
plates that have the same dialogue act specifica¬ 
tion, and then lexicalising those templates back to 

"^Except human evaluation, in which only one set of net¬ 
work was used. 



Table 1: The 8 system dialogue acts with example realisations 


# 

Dialogue act and example realisations of our system, by sampling from top-5 candidates 

1 

infonn(name=”stroganoff restaurant”,pricerange=cheap,near=”fishermans wharf’) 
stroganoff restaurant is a cheap restaurant near fishermans wharf. 
stroganoff restaurant is in the cheap price range near fishermans wharf. 

2 

reject(kidsallowed=yes,food=”basque”) 

unfortunately there are 0 restaurants that allow kids and serve basque . 

3 

informonly(name=”bund shanghai restaurant”, food=”Shanghainese”) 
i apologize , no other restaurant except bund shanghai restaurant that serves Shanghainese . 
sorry but there is no place other than the restaurant bund shanghai restaurant for shanghainese . 

4 

confirm(goodformeal=dontcare) 

i am sorry . just to confirm . you are looking for a restaurant good for any meal ? 
can i confirm that you do not care about what meal they offer ? 

5 

request(near) 

would you like to dine near a particular location ? 

6 

reqmoreO 

is there anything else i can do for you ? 

7 

select(kidsallowed=yes, kidsallowed=no) 

are you looking for a restaurant that allows kids , or does not allow kids ? 

8 

goodbyeO 

thank you for calling . good bye . 


Table 2: Comparison of top-1 utterance between 
the RNN-based system and three baselines. A 
two-tailed Wilcoxon rank sum test was applied to 
compare the RNN model with the best O&R sys¬ 
tem (the 3-slot, 5g configuration) over 10 random 
seeds. (*=p<.005) 


Method 

beam 

BLEU 

ERR 

handcrafted 

n/a 

0.440 

0 

kNN 

n/a 

0.591 

17.2 

O&R,0-slot,5g 

1/20 

0.527 

635.2 

0&R,l-slot,5g 

1/20 

0.610 

460.8 

0&R,2-slot,5g 

1/20 

0.719 

142.0 

0&R,3-slot,3g 

1/20 

0.760 

74.4 

0&R,3-slot,4g 

1/20 

0.758 

53.2 

0&R,3-slot,5g 

1/20 

0.757 

47.8 

Our Model 

1/20 

0.777* 

0* 


form utterances. In addition, the slot error (ERR) 
as described in Section 3.4, out of 1848 slots in 
1039 testing examples, was computed alongside 
the BLEU score. 

4.2 Empirical Comparison 

As can be seen in Table 2, we compare our pro¬ 
posed RNN-based method with three baselines: 
a handcrafted generator, a k-nearest neighbour 
method (kNN), and Oh and Rudnicky (2000)’s 
n-gram based approach (O&R). The handcrafted 
generator was tuned over a long period of time 
and has been used frequently to interact with real 
users. We found its performance is reliable and 
robust. The kNN was performed by computing 



Eigure 3: Comparison of our method (rnn) with 
O&R’s approach (5g) in terms of optimising top-5 
results over different selection beams. 

the similarity of the testing dialogue act 1-hot 
vector against all training examples. The most 
similar template in the training set was then se¬ 
lected and lexicalised as the testing realisation. 
We found our RNN generator significantly out¬ 
performs these two approaches. While compar¬ 
ing with the O&R system, we found that by par¬ 
titioning the corpus into more and more utterance 
classes, the O&R system can also reach a BLEU 
score of 0.76. However, the slot error cannot be 
efficiently reduced to zero even when using the er¬ 
ror itself as a reranking criterion. This problem is 
also noted in Mairesse and Young (2014). 

In contrast, the RNN system produces utter¬ 
ances without slot errors when reranking using the 
same number of candidates, and it achieves the 
highest BLEU score. Eigure 3 compares the RNN 
system with O&R’s system when randomly select- 
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Table 3: Pairwise comparison between four systems. Two quality evaluations (rating out of 5) and one 
preference test were performed in each case. Statistical significance was computed using a two-tailed 
Wilcoxon rank sum test and a two-tailed binomial test (*=p<.05, **=p<.005). _ 


Metrics 

handcrafted 

RNNi 

handcrafted 

RNN5 

RNNi 

RNN5 

O&R5 

RNN5 

148 dialogs, 829 utt. 

148 dialogs, 814 utt. 

144 dialogs, 799 utt. 

145 dialogs, 841 utt. 

Info. 

3.75 

3.81 

3.85 

3.93* 

3.75 

3.72 

4.02 

4.15* 

Nat. 

3.58 

3.74** 

3.57 

3.94** 

3.67 

3.58 

3.91 

4.02 

Pref. 

44.8% 

55.2%* 

37.2% 

62.8%** 

47.5% 

52.5% 

47.1% 

52.9% 


ing from the top-5 ranked results in order to intro¬ 
duce linguistic diversity. Results suggest that al¬ 
though O&R’s approach improves as the selection 
beam increases, the RNN-based system is still bet¬ 
ter in both metrics. Furthermore, the slot error of 
the RNN system drops to zero when the selection 
beam is around 50. This indicates that the RNN 
system is capable of generating paraphrases by 
simply increasing the number of candidates dur¬ 
ing the over-generation phase. 

4.3 Human Evaluation 

Whilst automated metrics provide useful informa¬ 
tion for comparing different systems, human test¬ 
ing is needed to assess subjective quality. To do 
this, about 60 judges were recruited using Amazon 
Mechanical Turk and system responses were gen¬ 
erated for the remaining 2571 unseen dialogues 
mentioned in Section 4.1. Each judge was then 
shown a randomly selected dialogue, turn by turn. 
At each turn, two utterances were generated from 
two different systems and presented to the judge 
who was asked to score each utterance in terms 
of informativeness and naturalness (rating out of 
5), and also asked to state a preference between 
the two taking account of the given dialogue act 
and the dialogue context. Here informativeness is 
defined as whether the utterance contains all the 
information specified in the dialogue act, and nat¬ 
uralness is defined as whether the utterance could 
have been produced by a human. The trial was run 
pairwise across four systems: the RNN system us¬ 
ing 1-best utterance RNNi, the RNN system sam¬ 
pling from the top 5 utterances RNN 5 , the O&R 
approach sampling from top 5 utterances O&R 5 , 
and a handcrafted baseline. 

The result is shown in Table 3. As can be 
seen, the human judges preferred both RNNi and 
RNN 5 compared to the rule-based generator and 
the preference is statistically significant. Further¬ 
more, the RNN systems scored higher in both in¬ 
formativeness and naturalness metrics, though the 
difference for informativeness is not statistically 


significant. When comparing RNNi with RNN 5 , 
RNNi was judged to produce higher quality ut¬ 
terances but overall the diversity of output offered 
by RNN 5 made it the preferred system. Even 
though the preference is not statistically signifi¬ 
cant, it echoes previous findings (Pon-Barry et al., 
2006; Mairesse and Young, 2014) that showed that 
language variability by paraphrasing in dialogue 
systems is generally beneficial. Lastly, RNN 5 was 
thought to be significantly better than O&R in 
terms of informativeness. This result verified our 
findings in Section 4.2 that O&R suffers from high 
slot error rates compared to the RNN system. 

4.4 Analysis 

In order to better understand the relative contribu¬ 
tion of each component in the RNN-based gener¬ 
ation process, a system was built in stages train¬ 
ing first only the forward RNN generator, then 
adding the CNN reranker, and finally the whole 
model including the backward RNN reranker. Ut¬ 
terance candidates were reranked using Equation 
( 8 ) rather than (9) to minimise manual interven¬ 
tion. As previously, the BLEU score and slot error 
(ERR) were measured. 

Gate The forward RNN generator was trained 
first with different feature gating factors 5. Using 
a selection beam of 20 and selecting the top 5 ut¬ 
terances, the result is shown in Figure 4 for 6=1 is 
(equivalent to not using the gate), 6=0 J, and 5=0 
(equivalent to turning off the feature immediately 
its corresponding slot has been generated). As can 
be seen, use of the feature gating substantially im¬ 
proves both BLEU score and slot error, and the 
best performance is achieved by setting 5=0. 

CNN The feature-gated forward RNN gen¬ 
erator was then extended by adding a single 
convolutional-pooling layer CNN reranker. As 
shown in Figure 5, evaluation was performed on 
both the original dataset (all) and the dataset con¬ 
taining only binary slots and don’t care values 
(hard). We found that the CNN reranker can better 
handle slots and values that cannot be explicitly 
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Figure 4: Feature gating effect 


Figure 5: CNN effect Figure 6: Backward RNN effect 


delexicalised (1.5% improvement on hard com¬ 
paring to 1% less on all). 

Backward RNN Lastly, the backward RNN 
reranker was added and trained to give the full 
generation model. The selection beam was fixed 
at 100 and the n-best top results from which to 
select the output utterance was varied as n = 1, 
5 and 10, trading accuracy for linguistic diversity. 
In each case, the BLEU score was computed with 
and without the backward RNN reranker. The re¬ 
sults shown in Figure 6 are consistent with Sec¬ 
tion 4.2, in which BLEU score degraded as more 
n-best utterances were chosen. As can be seen, 
the backward RNN reranker provides a stable im¬ 
provement no matter which value n is. 

Training corpus size Finally, Figure 7 shows 
the effect of varying the size of the training cor¬ 
pus. As can be seen, if only the 1-best utterance 
is offered to the user, then around 50% of the data 
(2000 utterances) is sufficient. However, if the lin¬ 
guistic variability provided by sampling from the 
top-5 utterances is required, then the figure sug¬ 
gest that more than 4156 utterances in the current 
training set are required. 



Figure 7: Networks trained with different propor¬ 
tion of data evaluated on two selection schemes. 


5 Conclusion and Future Work 

In this paper a neural network-based natural lan¬ 
guage generator has been presented in which a for¬ 
ward RNN generator, a CNN reranker, and back¬ 
ward RNN reranker are jointly optimised to gen¬ 
erate utterances conditioned by the required dia¬ 
logue act. The model can be trained on any cor¬ 
pus of dialogue act-utterance pairs without any se¬ 
mantic alignment and heavy feature engineering or 
handcrafting. The RNN-based generator is com¬ 
pared with an n-gram based generator which uses 
similar information. The n-gram generator can 
achieve similar BLEU scores but it is less efficient 
and prone to making errors in rendering all of the 
information contained in the input dialogue act. 

An evaluation by human judges indicated that 
our system can produce not only high quality but 
linguistically varied utterances. The latter is par¬ 
ticularly important in spoken dialogue systems 
where frequent repetition of identical output forms 
t. 

The work reported in this paper is part of a 
larger programme to develop techniques for im¬ 
plementing open domain spoken dialogue. A key 
potential advantage of neural network based lan¬ 
guage processing is the implicit use of distributed 
representations for words and a single compact 
parameter encoding of a wide range of syntac¬ 
tic/semantic forms. This suggests that it should 
be possible to transfer a well-trained generator of 
the form proposed here to a new domain using a 
much smaller set of adaptation data. This will be 
the focus of our future work in this area. 
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